feat: Ignore DPUs, so that they can be externally managed#85
feat: Ignore DPUs, so that they can be externally managed#85vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
2a94f1f to
68e2edc
Compare
🛡️ Vulnerability Scan🚨 Found 30 vulnerability(ies) Severity Breakdown:
📋 Top Vulnerabilities
💡 Note: Enable GitHub Advanced Security to see full details in the Security tab. |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 |
🛡️ CodeQL Analysis✅ No security issues found! 💡 Note: Enable GitHub Advanced Security to see full details in the Security tab. |
|
@vinodchitraliNVIDIA could you make sure to write up a description of your changes? This is hard to review because I have zero context for what this change is for. Particularly, why is this change needed? When doing zero-dpu testing we tested with hosts that had onboard NICs and it worked fine. I'm confused why there needs to be any changes here (I don't doubt there does, but I just want to understand what's lacking in the current code.) |
In zero DPU case - physically there wont be any DPU exists. If Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists
|
68e2edc to
30d4a00
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 |
🛡️ Vulnerability Scan🚨 Found 30 vulnerability(ies) Severity Breakdown:
📋 Top Vulnerabilities
💡 Note: Enable GitHub Advanced Security to see full details in the Security tab. |
🛡️ CodeQL Analysis✅ No security issues found! 💡 Note: Enable GitHub Advanced Security to see full details in the Security tab. |
|
we should find a way to support the use-case without building yet another feature. I'm concerned that even the zero-dpu path has no docs and barely seen any testing. This has even less (no unit-tests), so its very likely just to be forgotten and break. Any chance we can get this just aligned with regular zero-dpu? E.g. remove the DPUs from the target hosts, or somehow set them in NIC mode and then just use the NIC that the host boots from? I also feel like this change might just be the very beginning of supporting such hosts: We'd also need to look at the inventory path, SKUs and SKU validation, Software updates, etc. |
@vinodchitraliNVIDIA do you mean the DPU's are in NIC mode? Because the zero-dpu case already covers this (it we discover a host and its DPU is in nic-mode, it's treated as a zero-DPU host. We even have integration tests for this.) Or, do you mean these are DPF-managed hosts or something else, where we see a DPU, and it's not in NIC mode, but carbide opts to not manage the DPU in favor of something else managing it? If so, it's probably worth (a) spelling that out in the PR description, and (b) renaming some of the things here to indicate "externally managed DPU" instead of "onboard NIC" |
DPU's can be in any mode. "externally managed DPU" let me think through. Initially i used |
current ZERO DPU code path assumes that host will not have attached DPU, @kensimon correct me if i am wrong. Our case DPUs will be attached host, carbide DHCP will assign IP but they are not configured in managed host. Let me add few test cases |
30d4a00 to
761b7b6
Compare
The case when DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration. This allows DPUs stays outside the machine's lifecycle workflow Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
761b7b6 to
c903c8c
Compare
@vinodchitraliNVIDIA this is not the case, systems can have DPU's in what is called "NIC mode", which means they show up as a "dumb NIC" on the device and are not something we put any of our code on. This is a BIOS setting you have to set on the system. If it's set, we don't count it as a "DPU" on the device, it's just a regular NIC. So the zero-DPU code paths apply if the only DPU's on the host are NIC-mode. But a NIC-mode DPU means nobody else can put code on it either... so putting a DPU in NIC mode means it can't be DPF-managed either. |
| return Ok(false); | ||
| } | ||
| tracing::info!("Created managed_host with zero DPUs"); | ||
| } else if self.config.use_onboard_nic.load(Ordering::Relaxed) { |
There was a problem hiding this comment.
I still don't understand what this change is supposed to do. If we've gotten to this point in site_explorer, we've seen no DPU's on the host, and if zero-DPU configuration is allowed, we already ingest it with no DPU's.
Why do we need another config setting called use_onboard_nic, and another function create_onboard_nic_machine? What do these do that are different from the zero-dpu path?
There was a problem hiding this comment.
@kensimon is there any known bug ? Couple of months ago I tried with zero dpu flag. It din't work. Since the GB200 machine has multiple MAC address. Faced issue there. Also DPU list in managed host is not empty.
There was a problem hiding this comment.
Not that I know of? If there is a bug that you found, we should fix it. I don't think we need a fully separate code path for GB200's when we can just fix the current one.
If you have an issue with GB200's in zero-dpu mode, could you file an nvbug and assign it to me? I need logs/details/reproduction steps if possible.
|
@vinodchitraliNVIDIA can you schedule a meeting for how this is different from the Zero-DPU code path? Just like 15 minutes to explain what issues you're running into. Zero-DPU ideally should work with any type of NIC on the machine. |
Sure will do. Let me setup env with with zero DPU with latest GIT hub. |
Description
The case when DPU(s) physically exists on host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration. And Force Host to boot from NIC
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes